Session 5: Joins, factors, and pivoting
We’ve used dplyr for many data frame operations. In our last session, we’re trying out three new packages:
Joining synthesizes a new a table from two or more parent tables.
It can be helpful to have these relationships in mind when thinking about different ways we might want to join tables together.
Join type asks, “if I have unpaired rows when I join Tables A and B, what do I do with those rows?”
NA as neededNA as neededNA as neededEach of the above joins might produce a different number of rows.
Consider these two tables:
full_join(): combine all data# A tibble: 7 × 5
preOp_gender preOp_asa preOp_calcBMI preOp_age asa_status
<dbl> <dbl> <dbl> <dbl> <chr>
1 0 2 31.3 80 a patient with mild systemic d…
2 0 1 32.3 58 a normal healthy patient
3 1 3 30.5 65 a patient with severe systemic…
4 0 3 32.2 54 a patient with severe systemic…
5 1 NA 36.3 56 <NA>
6 0 2 28.0 68 a patient with mild systemic d…
7 NA 4 NA NA UNUSED CATEGORY
Note 7 rows: 5 with matches in both tables; one Table A patient with no pre-op ASA status reported; and one Table B unused category.
inner_join(): keep only rows from both tables# A tibble: 5 × 5
preOp_gender preOp_asa preOp_calcBMI preOp_age asa_status
<dbl> <dbl> <dbl> <dbl> <chr>
1 0 2 31.3 80 a patient with mild systemic d…
2 0 1 32.3 58 a normal healthy patient
3 1 3 30.5 65 a patient with severe systemic…
4 0 3 32.2 54 a patient with severe systemic…
5 0 2 28.0 68 a patient with mild systemic d…
Now we have only the five rows that appear in both tables.
left_join(): keep rows from x# A tibble: 6 × 5
preOp_gender preOp_asa preOp_calcBMI preOp_age asa_status
<dbl> <dbl> <dbl> <dbl> <chr>
1 0 2 31.3 80 a patient with mild systemic d…
2 0 1 32.3 58 a normal healthy patient
3 1 3 30.5 65 a patient with severe systemic…
4 0 3 32.2 54 a patient with severe systemic…
5 1 NA 36.3 56 <NA>
6 0 2 28.0 68 a patient with mild systemic d…
Now we have increased to 6 rows, because we kept all of the left table, including the patient with no pre-op ASA status reported.
right_join(): keep rows from y# A tibble: 6 × 5
preOp_gender preOp_asa preOp_calcBMI preOp_age asa_status
<dbl> <dbl> <dbl> <dbl> <chr>
1 0 2 31.3 80 a patient with mild systemic d…
2 0 1 32.3 58 a normal healthy patient
3 1 3 30.5 65 a patient with severe systemic…
4 0 3 32.2 54 a patient with severe systemic…
5 0 2 28.0 68 a patient with mild systemic d…
6 NA 4 NA NA UNUSED CATEGORY
6 rows again, but this time we have dropped the nonreporting patient, and we see instead the unused category.
semi_join(), anti_join(): filter x vs. yFiltering joins decide which rows to keep from x, based on the presence (or absence) of the value in y
Example: you applied inclusion criteria to get a list of patient ID’s to include in analysis, and now you would like to filter just those patients’ records into a data frame.
inclusion_ids <- tibble(subject_id = c(9134, 663, 5408))
covid_testing %>%
semi_join(y = inclusion_ids, by = join_by(subject_id))# A tibble: 3 × 17
subject_id fake_…¹ fake_…² gender pan_day test_id clini…³ result demo_…⁴ age
<dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 9134 grunt rivers male 7 covid clinic… negat… patient 0.8
2 663 ithoke targar… male 9 covid clinic… negat… patient 0.8
3 5408 alia ryswell female 10 covid clinic… negat… patient 0.9
# … with 7 more variables: drive_thru_ind <dbl>, ct_result <dbl>,
# orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>,
# rec_ver_tat <dbl>, and abbreviated variable names ¹fake_first_name,
# ²fake_last_name, ³clinic_name, ⁴demo_group
Typical things you’ll need to do with factors:
as_factor()fct_recode()fct_collapse()[1] A B C B A B
Levels: A B C
[1] A beta C beta A beta
Levels: A beta C
[1] Other Other C Other Other Other
Levels: Other C
[1] A B C B A B
Levels: B A C
value column, and each year into a new year columnpivot_longer()cols = the year columns (`2013`:`2019`)names_to =new column where you want the years recorded (year)values_to =new column where you want the values recorded (births)Before pivoting
# A tibble: 52 × 8
# Groups: State [52]
State `2013` `2014` `2015` `2016` `2017` `2018` `2019`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alaba… 10784 10018 9478 8960 8482 7848 7910
2 Alaska 1416 1290 1324 1166 972 846 786
3 Arizo… 14464 13244 11820 10714 10050 9300 8636
4 Arkan… 8310 7564 7354 6744 6356 5856 5764
5 Calif… 61010 54050 48350 42824 37870 33858 30712
6 Color… 7668 6754 6540 6136 5580 5044 4942
7 Conne… 3212 2840 2482 2272 2106 1976 1804
8 Delaw… 1456 1232 1080 1166 1104 994 888
9 Distr… 1274 1130 1002 920 816 746 646
10 Flori… 27924 25632 23914 22390 21416 19658 19082
# … with 42 more rows
Data source: CDC
After pivoting longer
# A tibble: 364 × 3
# Groups: State [52]
State year births
<chr> <chr> <dbl>
1 Alabama 2013 10784
2 Alabama 2014 10018
3 Alabama 2015 9478
4 Alabama 2016 8960
5 Alabama 2017 8482
6 Alabama 2018 7848
7 Alabama 2019 7910
8 Alaska 2013 1416
9 Alaska 2014 1290
10 Alaska 2015 1324
# … with 354 more rows
pivot_wider()names_from, values_fromThe ggplot2 stat_function() layer allows you to plot various statistical functions (such as a theoretical distribution). Note that for the two layers to be compatible, we transform the histogram’s y to density. dnorm() is the base R function for the Normal distribution.
annotate() is an additional ggplot layer you can add to plots:
ggplot2::ggsave() allows us to save plot objects in a variety of formats and parameters. Here are example calls:
# scalable vector graphics (SVG) format:
ggsave(filename="num12m-qq.svg", plot=num12m_qqp)
# 500 x 400 px PNG file, 100dpi:
ggsave(filename="num12m-qq.png", plot=num12m_qqp,
width=500, height=400, units="px", dpi=100)
# 4 in. x 6 in. PDF, 300dpi:
ggsave(filename="num12m-qq.pdf", plot=num12m_qqp,
width=4, height=6, units="in")
# note that this plot will be in a portrait formatsummary() also usable)t.test()chisq.test() (note: expects a matrix for 2-way)lm() with predict()These functions can be compatible with tidyverse work (but not %>% )
Are mean 12-month counts significantly different between sulindac (treatment) and placebo groups?
[1] NA 2 17 1 25 3 33 NA 3 1 4
[1] 63 28 61 7 15 44 28 10 40 46 50
Welch Two Sample t-test
data: sulindac12m and placebo12m
t = -3.6114, df = 16.901, p-value = 0.002172
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-40.79597 -10.69898
sample estimates:
mean of x mean of y
9.888889 35.636364
Is there a relationship between baseline and 12-month counts?
Call:
lm(formula = number12m ~ baseline, data = polyps)
Coefficients:
(Intercept) baseline
19.1647 0.1113
What 12-month count does this model predict for a baseline of 50? of 100?
t_test())
specify() variable/relationship of interesthypothesize() the null hypothesisgenerate() (or simulate if theoretical) the null distributioncalculate() a test statistic like Chisq, F, t, or zAre mean 12-month counts significantly different between sulindac (treatment) and placebo groups?
Is there a relationship between baseline and 12-month counts?
# parsnip::linear_reg()
polyps_model <- linear_reg() %>%
set_engine("lm") %>%
fit(number12m ~ baseline, data = polyps)
polyps_modelparsnip model object
Call:
stats::lm(formula = number12m ~ baseline, data = data)
Coefficients:
(Intercept) baseline
19.1647 0.1113
What 12-month count does this model predict for a baseline of 50? of 100?
geom_smooth)Run predict() on a data frame of your explanatory values. Then bind the predicted column back with the original data frame and plot the predictions with their own geom_line().
baseline_df <- polyps %>%
select(baseline)
predicted_df <- predict(polyps_model, new_data = baseline_df) %>%
rename(predicted_12m = ".pred") %>%
mutate(predicted_12m = round(predicted_12m, 1))
predicted_df# A tibble: 22 × 1
predicted_12m
<dbl>
1 19.9
2 27.7
3 19.9
4 19.7
5 21.7
6 23.1
7 20.4
8 20.5
9 19.9
10 54.6
# … with 12 more rows
We learned about:
Always feel free to reach out! dbordelon@pitt.edu
Summer R 5: Joins, factors, and pivoting